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ABSTRACT 

PageRank has become a key element in the success of search 
engines, allowing to rank the most important hits in the top 
screen of results. One key aspect that distinguishes PageR- 
ank from other prestige measures such as in-degree is its 
global nature. From the information provider perspective, 
this makes it difficult or impossible to predict how their 
pages will be ranked. Consequently a market has emerged 
for the optimization of search engine results. Here we study 
the accuracy with which PageRank can be approximated by 
in-degree, a local measure made freely available by search 
engines. Theoretical and empirical analyses lead to con- 
clude that given the weak degree correlations in the Web 
link graph, the approximation can be relatively accurate, 
giving service and information providers an effective new 
marketing tool. 

Categories and Subject Descriptors 

H. 3.3 [Information Storage and Retrieval]: Informa- 
tion Search and Retrieval; H.3.4 [Information Storage 
and Retrieval]: Systems and Software — Information net- 
works; H.3.5 [Information Storage and Retrieval]: On- 
line Information Services — Commercial, Web-based services; 
K.4.m [Computers and Society]: Miscellaneous 

General Terms 

Economics, Measurement 

Keywords 

Search engine optimization, PageRank, in-degree, mean field 
approximation, rank prediction. 

I. INTRODUCTION 

PageRank has become a key element in the success of 
Web search engines, allowing to rank the most important 
hits in the top page of results. Certainly the introduction of 
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PageRank as a factor in sorting results has contributed 
considerably to Google's lasting dominance in the search 
engine market ISj. 

But PageRank is not the only possible measure of impor- 
tance or prestige among Web pages. The simplest possible 
way to measure the prestige of a page is to count the in- 
coming links (in-links) to the page. The number of in-links 
(in-degree) is the number of citations that a page receives 
from the other pages, so there is a correlation between in- 
degree and quality, especially when the in-degree is large. 
The in-degree of Web pages is very cheap to compute and 
maintain, so that a search engine can easily keep in-degree 
updated with the evolution of the Web. 

However, in-degree is a local measure. All links to a page 
are considered equal, regardless of where they come from. 
Two pages with the same in-degree are considered equally 
important, even if one is cited by much more prestigious 
sources than the other. To modulate the prestige of a page 
with that of the pages pointing to it means to move from 
the examination of an individual node in the link graph to 
that of the node together with its predecessor neighbors. 
PageRank represents such a shift from the local measure 
given by in-degree toward a global measure where each Web 
page contributes to define the importance of every other 
page. 

From the information provider perspective, the global na- 
ture of PageRank makes it difficult or impossible to predict 
how a new page will be ranked. Yet it is vital for many 
service and information providers to have good rankings 
by major search engines for relevant keywords, given that 
search engines are the primary way that Internet users find 
and visit Web sites [171 112| . This situation makes PageR- 
ank a valuable good controlled by a few major search en- 
gines. Consequently a demand has emerged for companies 
who perform so-called search engine optimization or search 
engine marketing on behalf of business clients. The goal is 
to increase the rankings of their pages, thus directing traf- 
fic to their sites |14|. Search engine marketers have partial 
knowledge of how search engines rank pages. They have ac- 
cess to undocumented tools to measure PageRank, such as 
the Google toolbar. Through experience and empirical tests 



they can reverse-engineer some important ranking factors. 
However from inspecting the hundreds of bulletin boards 
and blogs maintained by search engine marketers it is evi- 
dent that their work is largely guided by guesswork, trial and 
error. Nevertheless search engine optimization has grown to 
be a healthy industry as illustrated by a recent study [7|- 
Search engine marketing has even assumed ethical and le- 
gal ramifications as a sort of arms race has ensued between 
marketers who want to increase their clients' rankings and 
search engines who want to maintain the integrity of their 
systems. The term search engine spam refers to those means 
of promoting Web sites that search engines deem unethical 
and worthy of blocking 6j. 

The status quo described above relies on two assumptions: 
(i) PageRank is a quantitatively different and better pres- 
tige measure compared to in-degree; and (ii) PageRank can- 
not be easily guessed or approximated by in-degree. There 
seems to be plenty of anecdotal and indirect evidence in sup- 
port of these assumptions — for example the popularity of 
PageRank — but little quantitative data to validate them. 
To wit, Amento et al. |5| report a very high average correla- 
tion between in-degree and PageRank (Spearman p = 0.93, 
Kendall r = 0.83) based on five queries. Further, they re- 
port the same average precision at 10 (60%) based on rel- 
evance assessments by human subjects. In this paper we 
further quantitatively explore these assumptions answering 
the following questions: What is the correlation between in- 
degree and PageRank across representative samples of the 
Web ? How accurately can one approximate PageRank from 
local knowledge of in-degree? 

From the definition of PageRank, other things being equal, 
the PageRank of a page grows with the in-degree of the page. 
Beyond this zero-order approximation, the actual relation 
between PageRank and in-degree has not been thoroughly 
investigated in the past. It is known that the distributions 
of PageRank and in-degree follow an almost identical pat- 
tern i.e., a curve ending with a broad tail that follows 
a power law with exponent about 2.1. This fact may indicate 
a strong correlation between the two variables. Surprisingly 
there is no agreement in prior literature about the corre- 
lation between PageRank and in-degree. Pandurangan et 
al. |11| show very little correlation based on analysis of the 
Brown domain and the TREC WTlOg collection. Donato et 
al. 13] report on a correlation coefficient which is basically 
zero based on analysis of a much larger sample (2-10* pages) 
taken from the WebBase |ltil collaboration. On the other 
hand, analysis of the University of Notre Dame domain by 
Nakamura TDI reveals a strong correlation. 

In Section|5|we estimate PageRank for a generic directed 
network within a mean field approach. We obtain a system 
of self-consistent relations for the average value of PageRank 
of all vertices with equal in-degree. For a network without 
degree-degree correlations the average PageRank turns out 
to be simply proportional to the in-degree, modulo an addi- 
tive constant. 

The prediction is validated empirically in Section|5] where 
we solve the equations numerically for four large samples of 
the Web graph; in each case the agreement between our 
theoretical estimate and the empirical data is excellent. We 
find that the Web graph is basically uncorrelated, so the 
average PageRank for each degree class can be well approxi- 
mated by a linear function of the in-degree. As an additional 
contribution we settle the issue of the correlation between 



PageRank and in-degree; the linear correlation coefficient is 
consistently large for all four samples we have examined, in 
agreement with Nakamura I1U| . We also calculate the size 
of the ffuctuations of PageRank about the average value 
and find that the relative fiuctuations decrease as the in- 
degree increases, which means that our mean field estimate 
becomes more accurate for important pages. 

Our results suggest that we can approximate PageRank 
from in-degree. By deriving PageRank with our formula we 
can predict the rank of a page within a hit list by knowing 
its in-degree and the number of hits in the list. Section^Jre- 
ports on an empirical study of this prediction, performed by 
submitting AltaVista ^ queries to the Google API 0. The 
actual ranks turn out to be scattered about the correspond- 
ing predictions. The implication of the fact that PageRank 
is mostly determined by in-degree is that it is possible to 
estimate the number of in-links that a new page needs in 
order to achieve a certain rank among all pages which deal 
with a specific topic. This provides search marketers — and 
information providers — with a new powerful tool to guide 
their campaigns. 

2. THEORETICAL ANALYSIS 

The PageRank p{i) of a page i is defined through the 
following expression: 

pW = ^+(i-g) E pWA-'O') i = i,2,...,iv (1) 

where A*' is the total number of pages, j —> i indicates a hy- 
perlink from j to i, kout{j) is the out-degree of page j and 
1 — q is the so-called damping factor. The set of Equations^ 
can be solved iteratively. From Eq. it is clear that the 
PageRank of a page grows with the PageRank of the pages 
that point to it. However, the sum over predecessor neigh- 
bors implies that PageRank also increases with the in-degree 
of the page. 

PageRank can be thought of as the stationary probability 
of a random walk process with additional random jumps. 
The physical description of the process is as follows: when a 
random walker is in a node of the network, at the next time 
step with probability q it jumps to a randomly chosen node 
and with probability 1 — q it moves to one of its successors 
with uniform probability. In the case of directed networks, 
there is the possibility that the node has no successors. In 
this case the walker jumps to a randomly chosen node of 
the network with probability one. The PageRank of a node 
i, p{i), is then the probability to find the walker at node i 
when the process has reached the steady state, a condition 
that is always guaranteed by the jumping probability q. 

The probability to find the walker at node i at time step 
n follows a simple Markovian equation: 



Pn(^) = j, + {l-q) E 



N 
N 



kout{j) 



Pri-l{j) 



(2) 



where Oji is the adjacency matrix with entry 1 if there is a 
direct connection between j and i and zero otherwise. The 
first term in Eq.|21is the contribution of walkers that decide 
to jump to a randomly chosen node, the second term is the 
random walk contribution, and the third term accounts for 
walkers that at the previous step were located in dangling 



points and now jump to random nodes. In the limit n oo 
this last contribution becomes a constant term affecting all 
the nodes in the same way and thus, it can be removed 
from Eq. |5| under the constraint that the final solution is 
properly normalized. Hereafter we will use this approach. 
The PageRank of page i is the steady state solution of Eq.|5| 
p{i) = lim„_^oop„(i). 

Equation |5| can be used as a numerical algorithm to com- 
pute PageRank but, unfortunately, it is not possible to ex- 
tract analytical solutions from it. In the next subsection we 
propose a mean field solution of Eq. |5| that, nevertheless, 
gives a very accurate description of the PageRank structure 
of the Web. 

2.1 Mean field analysis 

Instead of analyzing the PageRank of single pages, we 
aggregate pages in classes according to their degree k = 
{kin, kout) and define the average PageRank of nodes of de- 
gree class k as 



Pn(k) - ^p^^^ U 



(3) 



Note that now "degree class k" means all the nodes with 
in-degree kin and out-degree kout- Taking the average of 
Eq. |5|for all nodes of the degree class k we obtain 



+ 
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NP{k.) N 



-Pn-lU)- 



(4) 



From Eq.l^we see that the left-hand side of Eq.|l|is p„(k). 
In the right-hand side we split the sum over j into two sums, 
one over all the degree classes k' and the other over all the 
nodes within each degree class k'. We get 

At this point we perform our mean field approximation, 
which consists in substituting the PageRank of the prede- 
cessor neighbors of node i by its mean value, that is, 



E E 



Pn-l(k')E E "j'* 
iek jgk' 

= P„_i(k')i5k'-,k, (6) 

where -Ek'^k is the total number of links pointing from 
nodes of degree k' to nodes of degree k. This matrix can 
also be rewritten as 



Bk'^k = fc,„P(k)iV- 



^5k'-.k 



fc„,P(k)iV 
= fc,„P(k)iVP,„(k'|k), 



(7) 



where Pin (k'|k) is the probability that a predecessor of a 
node belonging to degree class k belongs to degree class k'. 
Using Equations |S| and |7| in Eq.|K|we finally obtain 



P„(k) = ^ + (1 - <l)k.n E ^^^^P„-l(k' 



(8) 



which is a closed set of equations for the average PageRank 
of pages in the same degree class. When the network has 
degree-degree correlations, the solution of this equation is 



non-trivial and the resulting PageRank can have a complex 
dependence on the degree. However, in the particular case 
of uncorrelated networks the transition probability Pi„ (k' |k) 
becomes independent of the degree k and takes the simpler 
form 



P4k'|k) = %^ 



(9) 



Using this expression in Eq. Q and taking the limit n oo, 
we obtain 



N N (fc,„)' 



(10) 



that is, the average PageRank of nodes of degree class k is 
independent of kout and proportional to fci„. 



2.2 Fluctuation analysis 

The formalism presented in the previous subsection gives 
a solution for the average PageRank of nodes of the same 
degree class but it tells us nothing about how PageRank 
is distributed within one degree class. To fill this gap, we 
extend our mean field approach to the fiuctuations within a 
degree class. To this end, we first start by taking the square 
of Eq. H 



As in the previous calculation, we take the average over 
degree classes of the square of PageRank and define 



(12) 
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Taking this average in Eg. inland rearranging terms we get 
g2 2g(l-Q), ^P4k'|k) 
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P„_i(k')p„_i(k"), (13) 



where we have used again the mean field approach. The 
probability Pi„(k', k"|k) is the joint probability that a node 
of degree k has simultaneously one predecessor of degree k' 
and another of degree k". We can make the further assump- 
tion that this joint distribution factorizes as Pin(k', k"|k) = 
Pin (k'|k)Pi„(k"|k). In this case we can write an equation 
for the standard deviation within a degree class, o"^(k) = 



p^„(k.) -p^(k), as follows: 



;(k) 



^ P.„(k'|k) , 
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P»n(k'|k). 



-i(k') 



(14) 



In the case of uncorrelated networks, this equation can be 
analytically solved in the limit n ~* oo: 



a^(k) 



1 



1 



~^in- (15) 



In the case of the Web, the heavy tail of the in-degree dis- 
tribution and the high average in-degree allows to simplify 
this expression as 



a^(k) 



(1 - lY 



kin ■ 



For large in-degrees, the coefficient of variation is 



The factor 



a(k) 

m 



1/2 



(16) 



(17) 



J. J / in this expression can be very large if the 

network is scale-free, which implies that the relative fluc- 
tuations are large for small in-degrees. However, for large 
in-degrees the relative fluctuations become less important 
— due to the factor kin in the denominator — and the av- 
erage PageRank obtained in the previous subsection gives 
a good approximation. This can be seen by analyzing the 
coefficient of variation for the nodes with the maximum de- 
gree fc™,"^. Assuming that kout is weakly correlated with 
kin, the coefficient ^F^) scales with the maximum in- 
degree as (fc™^)^~^'" and the coefficient of variation as 
(fcj^""")^"^'"/^. Since 7i„ > 2, the relative fluctuations go 
to zero. Then, for small in-degrees we expect PageRank to 
be distributed according to a power law; for intermediate 
in-degrees, according to a distribution peaked at the aver- 
age mean field value plus a power law tail; and for large 
in-degrees, according to a Gaussian distribution centered 
around the predicted mean field solution. 

3. RESULTS 

We analyzed four samples of the Web graph. Two of them 
were obtained by crawls performed in 2001 and 2003 by the 
WebBase collaboration |16|. The other two were collected 
by the WebGraph project [S| using the UbiCrawler the 
pages belong to two national domains, . uk (2002) and . it 
(2004), respectively. In TableQwe list the total number of 
vertices and edges and the average degree for each data set. 

We calculated PageRank with the standard iterative pro- 
cedure; the factor q was set to 0.15, as in the original paper 
by Brin and Page and many successive studies. The con- 
vergence of the algorithm is very quick: in each case less 
than a hundred iterations were enough to determine the re- 
sult with a relative accuracy of 10~^ for each vertex. In 
Fig. we show the distributions of PageRank. In all four 



Table 1: Number of pages, links, and average degree 
((fe) = (kin) = {kout}) for the four data sets we have 
analyzed. 



Data set 


WB 2001 .uk 2002 WB 2003 .it 2004 


# pages 

# links 

{k) 


8.1 X 10'' 1.9 X 10' 4.9 X 10' 4.1 x 10' 
7.5 X 10** 2.9 X 10* 1.2 x 10^ 1.1 x 10^ 
9.34 15.78 24.05 27.50 



Table 2: Exponents of the power law part of the 
PageRank distribution and linear correlation coeffi- 
cients between PageRank and in-degree. 



Data set 


WB 2001 


.uk 2002 


WB 2003 


. it 2004 




2.2 ±0.1 


2.0 ±0.1 


2.0 ±0.1 


2.0 ±0.1 


P 


0.538 


0.554 


0.483 


0.733 



cases we obtained a pattern with a broad tail. The initial 
part of the distribution can be well fitted by a power law p~'^ 
with exponent /3 between 2.0 and 2.2. This is in agreement 
with the findings of refs. |lllll|. The right-most part of each 
curve, corresponding to the pages with highest PageRank, 
decreases faster. For the WebBase sample of 2001 the tail 
of the curve up to the last point can be well fitted by a 
power law with exponent P ~ 2.6; in the other cases we see 
evidence of an exponential cutoff. 

We have also calculated the linear correlation coefficient 
between PageRank and in-degree. In Table |5| we list Pear- 
son's p together with the slope of the power law portions 
of the PageRank distributions. We see that the correlation 
between PageRank and in-degree is rather strong, in con- 
trast to the findings of refs. |11| and especially |1] but in 
agreement with ref. '10'. 

We solved Eq.|H]with an analogous iterative procedure as 
the one we used to calculate PageRank. We now look for 
the vector p(k), defined for all pairs k = {kin, kout) which 
occur in the network. Since PageRank is a probability, it 
must be normalized so that its sum over all vertices of the 
network is one. So we initialized the vector with the con- 
stant po(k) = 1/A'^, and plugged it into the right-hand side 
of Eq.|Hlto get the first approximation pi(k). We then used 
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Figure 1: PageRank distributions. 
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Figure 2: Scatter plots of the empirical average 
PageRank per degree class versus our mean field 
(MF) estimate. 



Figure 4: Coefficient of variation of PageRank ver- 
sus in-degree. 
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Figure 3: PageRank versus in-degree; the dashed 
line is the approximation given by the closed formula 
of Eq. [Tni 



pi(k) as input to get P2(k), and so on. We remark that the 
expression of the probabihty Pi„(k'|k) is not a necessary 
ingredient of the calculation. In fact, the sum on the right- 
hand side of Eq.|H]is just the average value of p„_i(k')/fe'out 
among all predecessors of vertices with degree k. The algo- 
rithm leads to convergence within a few iterations (we never 
needed more than 20). In Fig. H we compare the values of 
p(k) calculated from Eq.|H|with the corresponding empirical 
values. Here we averaged p(k) over out-degree, so it only 
depends on the in-degree kin- The variation of p(k) with 
kout (for fixed kin) turns out to be very small. The scatter 
plots of Fig.|5|show that the mean field approximation gives 
excellent results: the points are very tightly concentrated 
about each frame bisector, drawn as a guide to the eye. 

We now analyze explicitly the relation between PageR- 
ank and in-degree. To plot the function p{kin) directly is 
not very helpful because the wide fluctuations of PageRank 
within each degree class would mistily the pattern for large 
values of kin. The best thing to do is to average PageR- 



ank within bins of in-degree. As both PageRank and in- 
degree are power-law distributed, we decided to use loga- 
rithmic bins; the multiplicative factor for the bin size is 1.3. 
The resulting patterns for our four Web samples are pre- 
sented in Fig. 121 The empirical curves are rather smooth, 
and show that the average PageRank (per degree class) is an 
increasing function of in-degree. The relation between the 
two variables is approximately linear for large in-degrees. 
This is exactly what we would expect if the degrees of pages 
were uncorrelated with those of their neighbors in the Web 
graph. In such a case the relation between PageRank and 
in-degree is given by Eg. 1101 Indeed, the comparison of the 
empirical data with the curves of Eq. 1101 in Fig. |2l is quite 
good for all data sets. We infer that the Web graph is an 
essentially uncorrelated graph; this is confirmed by direct 
measurements of degree-degree correlations in our four Web 
samples Ifi . What is most important, the average PageR- 
ank of a page with in-degree kin is well approximated by 
the simple expression of Eq. 1101 The possible applications 
of this result are examined in the next section. 

Let us analyze the empirical fluctuations of PageRank 
about its mean value. We anticipated in Section r2.2l that we 
expect large fluctuations for small values of kin, due to the 
large value of the second momentum of the in-degree distri- 
bution, and that the relative size of the fluctuations should 
decrease as kin increases ("Eo. 1171 . Fig. 0] confirms our pre- 
diction. We plotted the coefficient of variation o-{ki„) /p[kin) 
as a function of kin, once again averaging over out-degree. 
The trend is clearly decreasing as kin increases. The fiuctu- 
ations of the data points are due to degree-degree correla- 
tions (even if they are small, they are not completely negligi- 
ble) . We also derived mean field estimates for the coefficient 
of variation. Rather than solving the complete Eq. 1141 we 
used the coefficient of variation for an uncorrelated network, 
given by the ratio between a{kin) from Eo. 1151 and p{kin) 
from Eq. 1101 Nevertheless, the agreement between our ap- 
proximated estimates and the empirical results in Fig. |1] is 
very good except for high kin, where we have an insuffi- 
cient number of points in each degree class leading to high 
fluctuations. 

Finally, we test our prediction for the distribution of PageR- 
ank within a degree class. In Fig. |S|we plot the PageRank 




Figure 5: Distributions of PageRank for four degree Figure 6: Dependence of the rank of a page on its 

classes in the Hnk graph from the WebBase 2003 PageRank value 

crawl. 



distributions for four classes, corresponding to in-degree 1, 
10, 100 and 1000. The data refer to the WebBase sample of 
2003, but we found the same trend for the other three data 
sets. We see that for low in-degrees {kin = 1, 10) the distri- 
bution is a power law. The exponent is essentially the same 
(2.3 ± 0.1, as for the other samples). However, for higher 
degrees {ki„ = 100, 1000), the distribution changes from a 
power law to a hybrid distribution between a Gaussian and 
a power law. The Gaussian is signaled by the peak, which in 
the double logarithmic scale of the plot appears quite flat; 
the power law is manifest in the long tail of the distribution. 
The power law tails are all approximately parallel to each 
other, i.e., the exponent is the same for all curves. 

4. APPLICATIONS TO THE LIVE WEB 

We have seen that the average PageRank of a page with 
in-degree kin can be well approximated by the closed for- 
mula in Eo. 1101 We have also found that PageRank fluctu- 
ations about the average become less important for larger 
kin- These two results suggest that for large enough kin 
the PageRank of a page with kin in-links only depends on 
kin, and that Eg. 1101 should give at least the correct order 
of magnitude for its PageRank. To use Eg. 1101 for the Web 
we need to know the total number A'^ of Web pages indexed 
by Google and their average degree (kin). The size of the 
Google index was published until recently. We use the last 
reported number. A*' ~ 8.1 x 10®. The average degree is 
not known; the best we can do is extract it from samples 
of the Web graph. Our data sets do not deliver a unique 
value for (kin), but they agree on the order of magnitude 
(see Table 0. Hereafter we use {ki„) = 10. 

In this section we want to see whether Eg. llOl can be useful 
in the live Web. Ideally we should compare the PageRank 
values of a list of Web pages with the corresponding val- 
ues derived through our formula. Unfortunately the real 
PageRank values calculated by Google are not accessible, 
so we need a different strategy. The simplest choice is to 
focus on rank rather than PageRank. We know that Google 
ranks Web pages according to their PageRank values as well 
as other features which do not depend on Web topology. 
The latter features are not disclosed; in the following we 



disregard them and assume for simplicity that the Google 
ranking of a Web page exclusively depends on its PageRank 
value. There is a simple relation between the PageRank p of 
a Web page and the rank R of that page. The Zipf function 
R(p) is simply proportional to the cumulative distribution 
of PageRank. Since the PageRank distribution is approxi- 
mately a power law with exponent 13 ~ 2.1 (see Section|3, 
we find that 

R{p) ~ Ap-^ (18) 

where a = /3— 1~1.1 and A is a proportionality constant. 
Eg. 1181 can be empirically tested. Fig. [5] shows the pattern 
for the WebBase sample of 2003. The ansatz of Eg.ll8lfwith 
a = 1.1) reproduces the data for over three orders of mag- 
nitude. The rank R referred to above is the global rank of 
a page of PageRank p, i.e., its position in the list contain- 
ing all pages of the Web in decreasing order of PageRank. 
More interesting for information providers and search en- 
gine marketers is the rank within hit lists returned for actual 
queries, where only a limited number of result pages appear. 
We need a criterion to pass from the global rank R to the 
rank r within a query's hit list. A page with global rank R 
could appear at any position r = 1, 2, . . . , n in a list with n 
hits. In our framework pages differ only by their PageRank 
values (or, equivalently, by their in-degrees), as we neglect 
all semantic features. Therefore we can assume that each 
Web page has the same probability to appear in a hit list. 
This is a strong assumption, but even if it may fail to de- 
scribe what happens at the level of an individual query, it is 
a fair approximation when one considers a large number of 
queries. Under this hypothesis the probability distribution 
of the possible positions is a Poissonian, and the expected 
local rank r of a page with global rank R is given by the 
mean value: 



Now it is possible to test the applicability of Eq. 1101 to 
the Web. We are able to estimate the rank of a Web page 
within a hit list if we know the number of in-links kin of the 
page and the number n of hits in the list. The procedure 
consists of three simple steps: 
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Figure 7: Density map of the scatter plot between 
predicted rank rest and actual rank remp for 65,207 
queries. The fraction of points in each log-size bin is 
expressed by the color, also on a logarithmic scale. 
The diagonal guide to the eye corresponds to rest ~ 

^emp . 

1. from kin we calculate the PageRank p of the page ac- 
cording to Eq. 1101 

2. from p we determine the global rank R according to 
Eq.ESl 

3. from R and n we derive the local rank r according to 

Eq.cni 

The combination of the three steps leads to the following 
expression of the local rank r as a function of kin and n: 



The natural way to derive the parameter A would be to 
perform a fit of the empirical relation between global rank 
and PageRank, as we did in Fig. |H] The result should then 
be extrapolated to the full Web graph. As it turns out, the 
A value derived in this way strongly depends on the sample 
of the Web, so that one could do no better than estimating 
the order of magnitude of A. On the other hand A is a 
simple multiplicative constant, and its value has no effect on 



the dependence of the local rank r on the variables kin and 
n. Therefore we decided to consider it as a free parameter, 
whose value is to be determined by the comparison with 
empirical data. 

For our analysis we used a set of 65, 207 actual queries 
from a September 2001 AltaVista log 1 . We submitted each 
query to Google, and picked at random one of the pages of 
the corresponding hit list. For each selected page, we stored 
its actual rank r^mp within the hit list, as well as its number 
kin of in-links, which was again determined through Google. 
The number n of hits of the list was also stored. Google 
(like other search engines) never displays more than 1000 
results per query, so we always have Temp < 1000. From 
kin and n we estimated the theoretical rank r^st by means 
of Eq. 1201 and compared it with its empirical counterpart 
remp- The comparison can be seen in the scatter plot of 
Fig. Q Given the large number of queries and the broad 
range of rank values, we visualize the density of points in 
logarithmic bins. The region with highest density is a stripe 
centered on the diagonal line Teat ~ remp by a suitable choice 
of A (A = 1.5 X 10"''). We conclude that the rank derived 
through Eg. 1201 is in most cases close to the empirical one. 
We stress that this result is not trivial, because (i) Web 
pages are not ranked exclusively according to PageRank; 
(ii) we are neglecting PageRank fiuctuations; and (iii) all 
pages do not have the same probability of being relevant 
with respect to a query. 

5. DISCUSSION 

The present study motivates further enquires. The mean 
field approach provides a simple functional relationship be- 
tween average PageRank, in-degree, and degree-degree cor- 
relations. The price one pays by using such a simple ap- 
proximation is the neglect of the significant fluctuations of 
PageRank values around the mean field average within a 
degree class. For the majority of pages, having moderate 
PageRank, fiuctuations are more important; the in-degree 
being equal, they make the difference between being linked 
by "good" or "bad" pages. A venue we intend to pursue is 
to understand what makes the difference between two pages 
with the same in-degree and a very different value of PageR- 
ank, and how pages with higher PageRank are differently 
positioned in the complex architecture of the Web graph. 

The approach described here lends itself naturally to ap- 
plications other than the Web, e.g., bibliometry. Commonly 
the quality of papers is assessed via the number of citations 
they receive, and it would be useful to be able to rank papers 
with the same number of citations through their PageRank 
values. A characterization of papers leading to high PageR- 
ank fluctuations would be useful in this domain as well. 

A promising way to study fluctuations at a moderate price 
in increased complexity could be to use the definition of 
Eq. Qwhere the value of PageRank on the right-hand side is 
substituted by the mean field approximation. Further work 
is needed in this direction. 

In this paper we have quantitatively explored two key as- 
sumptions around the current search status quo, namely 
that PageRank is very different from in-degree due to its 
global nature and that PageRank cannot be easily guessed or 
approximated without global knowledge of the Web graph. 
We have shown that due to the weak degree-degree corre- 
lations in the Web link graph, PageRank is strongly cor- 
related with in-degree and thus the two measures provide 



very similar information — the PageRank factor used by 
search engines to rank pages can be effectively replaced by 
in-degree, especially for the most popular pages. Further, 
we have introduced a general mean field approximation of 
PageRank that, in the specific case of the Web, allows to 
estimate PageRank from only local knowledge of in-degree. 
We have further quantified the fluctuations of this approx- 
imation, gauging the reliability of the estimate. Finally we 
have validated the approach with a simple procedure that 
predicts how actual Web pages are ranked by Google in re- 
sponse to actual queries, using only knowledge about in- 
degree and the number of query results. 

Our method has immediate application for information 
providers. For instance, the association between rank and 
in-degree allows one to deduce how many in-links would be 
needed for a new page to achieve a given rank among all 
pages that deal with the same topic. This is an issue of 
crucial economic impact: all companies that advertise their 
products and services online wish for their homepages to 
belong among the top-ranked sites in their business sector. 
Suppose that someone wants their homepage H to appear 
among the top n pages about topic T. Our recipe is ex- 
tremely simple and cheap, requiring the submission of two 
queries to the search engine: 

f. submit a query to Google (or another search engine) 
about topic T; 

2. find the number fc„ of in-links for the n-th page in the 
resulting hit list; 

3. H needs at least fc„ in-links to appear among the first 
n hits for topic T. 

Of course there are limits to this approach; we do not claim 
that the lower bound of the number of in-links can be taken 
as a safe guide. Indeed we have neglected important factors 
such as the role of page content in retrieving and ranking 
results, and the fluctuations of the mean field approximation 
of PageRank. 

Notwithstanding the above caveats, our results indicate 
that at least the order of magnitude should be a reliable ref- 
erence point. This may be all that is necessary — knowing 
the difference between the need for one thousand or one mil- 
lion links can be a crucial asset in planning and budgeting a 
marketing campaign. Is word of mouth sufficient, or is ad- 
vertising required? Our approach provides a tool to answer 
this kind of questions. In making such a tool available to 
search engine marketers and information providers alike, we 
hope to create a more level playing field so that not only 
large and powerful organizations but also small communi- 
ties with little or no marketing budget can make informed 
decisions about the management of their Web presence. 
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